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Abstract 

Many problems in bioinformatics are about finding strings that approximately represent a 
collection of given strings. We look at more general problems where some input strings can be 
classified as outliers. The Close to Most Strings problem is, given a set S of same-length strings, 
and a parameter d, find a string x that maximizes the number of "non-outliers" within Hamming 
distance d of x. We prove this problem has no PTAS unless ZPP = NP, correcting a decade-old 
mistake. The Most Strings with Few Bad Columns problem is to find a maximum-size subset 
of input strings so that the number of non-identical positions is at most k; we show it has no 
PTAS unless P = NP. We also observe Closest to k Strings has no EPTAS unless W[l] FPT. 
In sum, outliers help model problems associated with using biological data, but we show the 
problem of finding an approximate solution is computationally difficult. 

1 Introduction 

With the development of high-throughput next generation sequencing technologies, there has arisen 
large amounts of genomic data, and an increased need for novel ways to analyze this data. This 
has inspired numerous formulations of biological tasks as computational problems. In light of this 
observation, Lanctot et al. |15| initiated the study of distinguishing string selection problems, where 
we seek a representative string satisfying some distance constraints from each of the input strings. 
We will mostly have constraints in the form of an upper bound on the Hamming distance, but lower 
bounds on the Hamming distance, and substring distances, have also been considered [6t fTTj [T5]. 

Typically, the distance constraint must be satisfied for each of the input strings. However, 
biological sequence data is subject to frequent random mutations and errors, particularly in specific 
segments of the data; requiring that the solution fits the entire input data is problematic for many 
problems in bioinformatics. It would be preferable to find the similarity of a portion of the input 
strings, excluding a few bad reads that have been corrupted, rather than trying to fit the complete 
set of input and in doing so finding one that is distant from many or all of the strings. 
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What if we are given a measure of goodness (e.g., distance) the representative must satisfy, 
and want to choose the largest subset of strings with such a representative? Conversely, what if 
we specify the subset size and seek a representative that is as good as possible? Some results are 
known in this area with respect to fixed-parameter tractability [5]. Here, we prove results about 
the approximability of three string selection problems with outliers. For any two strings x and y 
of same length, we denote the Hamming distance between them as d{x, y), which is defined as the 
number of mismatched positions. Our main results are about three NP optimization problems. 

Definition 1. Close to Most Strings (a.k.a. Max Close String fT^ [W ^ ) 
Input: n strings S = {si, . . . , s„} of length £ over an alphabet T,, and d G Z_|_. 
Solution: a string s of length £. 

Objective: maximize the number of strings Sj in S that satisfy d{s,Si) < d. 
Definition 2. Closest to k Strings 

Input: n strings S = {si, . . . , s„} of length i over an alphabet T,, and k G Z+. 
Solution: a string s of length i and a subset S* of S of size k. 
Objective: minimize max{d{s,Si) \ Si G S*}. 

In the special case k = n, Closest to k Strings becomes Closest String — an NP-hard problem [9] 
that has received significant interest in parameterized complexity and approximability [H [2l [TU[ \TE\ 

We also consider a problem where the "outliers" are considered to be positions ("columns") 
rather than strings ("rows"). Let s{j) indicate the jth character of string s. 

Definition 3. Most Strings with Few Bad Columns 

Input: n strings S = {si, . . . , s„} of length i over an alphabet T,, and k G Z-|_. 

Solution: a subset S* C S of strings such that the number {t G [i] \ 3s*, s* G S* : s*{t) / s*{t)} of 

bad columns is at most k. 

Objective: maximize \S*\. 

In other words, a column t is bad when its entries are not-all-equal, among strings in S* . The Most 
Strings with Few Bad Columns Problem generalizes the problem of finding tandem repeats in a 
string |l6j. 

1.1 Our contributions 

A PTAS for a minimization problem is an algorithm that takes an instance of the problem and a 
parameter e > and, in time that is polynomial for any fixed e, produces a solution that is within 
a factor 1 + e of being optimal. An efficient PTAS (EPTAS) further restricts the running time 
to be some function of e times a constant-degree polynomial in the input size. We present several 
results on the computational hardness of efficiently finding an approximate solution to the above 
optimization problems. Specifically, we show the following: 

• The Close to Most Strings Problem has no PTAS, unless ZPP = NP (Theorem [T]). 

• The Most Strings with Few Bad Columns Problem has no PTAS, unless P = NP (Theorem[2|. 

• We observe that the known PTAS |20j for the Closest to k Strings Problem cannot be improved 
to an EPTAS, unless W[l] = FPT. 
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Our first result corrects an error in prior literature. A problem is APX-hard if for some fixed 
e > 0, finding a (1 + e)-approximation is NP-hard. A 2000 paper of Ma [20] claims that the Close to 
Most Strings problem is APX-hard; however, the reduction is erroneous. To explain, it is helpful to 
define one more problem. Far from Most Strings, which is the same as Close to Most Strings except 
that we want to maximize the number of strings Sj in S that satisfy d{s, Si) > d (rather than <). 
There is considerable experimental interest in heuristics for Far from Most Strings, mostly based 
on local search |T9l[71|8]. Far From Most Strings was introduced and studied by Lanctot et al. p^, 
and they (correctly) showed that for any fixed alphabet size greater than or equal to three. Far from 
Most Strings is at least as hard to approximate as Independent Set. Currently, Independent Set is 
known [HI to be inapproximable within a factor of n/2^°§^''*^' " unless NP C BPTIME(2'°^°''^' "). 

The main idea in Ma's approach was to consider a binary alphabet. In detail, the Far from Most 
Strings and Close to Most Strings Problem on alphabets S = {0, 1} are basically the same problem, 
since a string s of length I has distance at most d from Sj if and only if the complementary string s 
has distance at least (. — d from Si . The crucial error in [20] is that Ma mis-cited [15] , assuming that 
their result worked on binary alphabets. (One reason why the approach of |15j does not extend 
to binary alphabets in any obvious way is that the instances produced by their reduction satisfy 
d = i, whereas Far from Most Strings is easy to solve when |S| = 2 and d = i.) 

From [12] and [20j we cannot conclude anything about the hardness of Close to Most Strings, 
nor can we say anything about the hardness of Far from Most Strings when |S| = 2. Our results 
close both of these gaps: the proof of Theorem [T] actually shows Close to Most Strings is hard over 
a binary alphabet, from which it follows that Far from Most Strings is, too. At the same time, 
the hardness that we are able to achieve is much more modest than the previous claim; we show 
only that there is no 1.001-approximation. We also require a randomized reduction. It is a very 
interesting open problem to determine whether this problem has any constant-factor approximation, 
even over a binary alphabet. 

1.2 Brief Description of Parameterized Complexity 

Some parameterized complexity concepts will arise in later sections, so we give a birds-eye view 
of this area. With respect to a parameter k, a decision algorithm with running time f{k)rP'^^^ 
(where n is the input length) is called fixed parameter tractable (FPT); the class FPT contains all 
parameterized problems with FPT algorithms. The corresponding reduction notion between two 
parameterized problems is an FPT reduction, which is FPT, and also increases the parameter by 
some function that is independent of the instance size. The class W[l] is a superset of FPT closed 
under FPT-reductions. A problem is \N[V\-hard if any V\/[l] problem can be FPT-reduced to it, and 
ysl[V\-complete if it is both in V\/[l] and W[l]-hard. There are many natural W[l]-complete problems, 
like Maximum Clique parameterized by clique size. It is widely hypothesized that FPT C W[l], but 
unproven, analogous to P C NP. 

2 Approximation Hardness of Close to Most Strings 

Theorem 1. For some e > 0, if there is a polynomial-time (1 + e) -approximation algorithm for the 
Close to Most Strings Problem, then ZPP = NP. 

Proof. We use a reduction from the Max-2-SAT Problem, which is to determine for a given 2-CNF 
formula, an assignment that satisfies the maximum number of clauses. Let X = {xi, . . . ,Xn} be 
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Figure 1: Overview of the reduction used to prove Theorem [T| 



a set of Boolean variables. In 2-CNF, each clause is a disjunction of two literals, each of which is 
either Xi or Xi for some i. Hastad yL2] showed it is NP-hard to compute a 22/21-approximately 
optimal solution to Max-2-SAT, and this is the starting point for our proof. We will assume that 
m > n, i.e. the number of clauses is greater than or equal to the number of variables, which is 
without loss of generality since otherwise some variable appears in at most one clause and the 
instance can be reduced. 

We give a schematic overview of our reduction in Figure [T] The reduction will be randomized. 
It takes as input an instance of Max-2-SAT with m clauses and n variables. The reduction's output 
is an instance of Close to Most Strings with cm + m strings of length 2n for some constant c, and 
the distance parameter of the instance is d = n. Of these strings, cm will be "fixing" strings to 
enforce a certain structure in near-optimal solutions, and the remaining m strings are defined from 
the clauses as follows. Given a 2-clause coj over the variables in X, we define the corresponding 
string Sj = Sj{l) . . . Sj{2n) as follows: 

00 if u}j contains the literal xj, 
11 if LOj contains the literal Xi, 

01 otherwise. 

The fixing strings will all be elements of {01, 10}", selected independently and uniformly at random. 

We now give a high-level explanation of the proof. For every variable assignment vector x define 
a string x via 

s I 11 if Xi is true, 
x{2i - l)x{2i) = { 

I 00 if Xi is false. 

Notice that x is at distance exactly d = n from all of the fixing strings, and that d{x, Sj) < n if 
and only if x satisfies clause ujj. Hence, if x satisfies k clauses, the string x is within distance d of 
cm + k out of the cm + m total strings. We will show conversely that with high probability, for all 
strings s within distance d of cm of the strings, we have s G {00, 11}". Using this crucial structural 



s,{2i-l)sji2i) = { 
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claim, it follows that any sufficiently good approximation algorithm for Close to Most Strings must 
output s such that s = x for some x. Then the claim will be complete via standard calculations. 
Here is the precise statement of the structural property. 

Lemma 1. For c > 20, the following holds. Let F be a set of cm strings selected uniformly and 
independently at random from {01,10}" (with replacement), with m > n. Then with probability 
at least 1 — 0.9", every string s G {0, 1}^" \ {00, 11}" has distance greater than n from at least m 
strings in F. 

Proof. To explain the proof more simply, fix s and consider a particular f £ F. By hypothesis, this 
s satisfies s{2i — 1) ^ s{2i), say s{2i — 1) = and s{2i) = 1 (the other case is symmetric). Since / 
is chosen uniformly at random from {01, 10}", the event £ where f{2i — 1) = 1 and f{2i) = has 
Pr[iS] = 1/2. A short calculation which we postpone momentarily shows that Pr[(i(s, /) > ?^ + 1 | 
£] > 1/2. So unconditioning, Pr[d(s, /) > n + 1] = Pr[(i(s, /) > n + 1 | £:] • Pr[^:] > 1/4. 

Let us verify now that Pr[d(s,/) > n + 1 \ £] > 1/2. Observe that d{s,f) is a sum of n 
independent random variables d{s{2j — l)s(2j),/(2j — l)/(2j)) for j from 1 to n; conditioning on 
8 just fixes one of these variables at 2. The remaining ones are either always 1 (if s{2j — 1) = 
s{2j)), or a uniformly random element of {0,2}. The conditioned random variable d{s,f) \ E is 
thus a shifted and scaled binomial distribution, in particular it is symmetric about n + 1. So 
Pr[(i(s, /) > n + 1 I f ] = Pr[(i(s, /) < n + 1 | iS] and since these two probabilities' sum is at least 1, 
Pr[d(s, /) > n + 1 I £:] > 1/2 follows. 

We will continue reasoning about this fixed s, and use a Chernoff bound to get large enough 
probability to work for all possible s. Let F = {/i, . . . , fcm} and let Xi be an indicator variable 
for the event that d{fi,s) > n. We have argued that each Xi is 1 with probability at least 1/4. 
Therefore, E[Y^-Xi] > cm/A. We will use a Chernoff bound of the following form: 

Claim 1 (Lower Chernoff bound, [2I])' If ^ a sum of independent 0-1 random variables, then 
we have 

Pr[X < (1 - S)E[X]] < exp{-E[X]6^/2). 

Choose 6 so that (1 - 6)cm/4 = m, i.e. 6 = 1- 4/c. Then Pt[X < m] < Fi[X < (1 - S)E[X]] < 
exp(— cm/4 • (1 — 4/c)^/2) = exp( ~^'g^^^ m). By a union bound over all 4" — 2" possible choices of 
s, the probability that a random choice of F admits any bad s is at most 

(4"-2")exp(^^^^^m) < 4" exp(^^^^^n) = exp((ln4 - ^^^i^)n) . 

Any large enough c makes this exponentially decreasing in n; it is straightforward to calculate that 
when c = 20 this is at most 0.9", as needed. □ 

Now let us complete the overall proof; fix c = 20. Given a Max-2-SAT instance, we run 
the randomized reduction above to get an instance of Close to Most Strings. Let sa be a (1 + e)- 
approximation for this instance, where e will be a small constant fixed later to satisfy two properties. 

Let k* be the maximum number of satisfiable clauses in the Max-2-SAT instance. As an 
important technicality, note that k* is lower-bounded by m/2, since the expected number of clauses 
satisfied by a random assignment is at least m/2, by linearity of expectation. So the optimal solution 
to the Close to Most Strings instance has value at least cm + m/2. 

First we want to use the structural lemma (Lemma [T]) . Assume for now the bad event with 
probability 0.9" does not happen; so every s {00, 11}" (i.e. not of the form s = x) is within 
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distance d of at most cm of the (c + l)m strings. Thus provided that e is small enough to satisfy 

1 + e < ^'"^^^ ~ ^ "I" 5c' ^^^^ is of the form xa for some x^. 

Next we finish the typical calculations in a proof of APX-hardness. We know that sa is within 
distance d of at least (cm + + e) strings. If we can pick e so that 

cm + k* 21 _ , . 

>cmH k (1) 

1 + e 22 ^ ^ 

then XA satisfies more than '^k* clauses, which is NP-hard by Hastad's result. Using that k* > m/2, 
it is easy to verify that ([T]) holds for all e < 1/(21 + 44c). 

Finally, we confirm that the randomized algorithm for Max-2-SAT coming from the reduction is 
ZPP-style, i.e. Las Vegas style. As long as the output sa of the Close to Most Strings approximation 
algorithm satisfies sa {00, 11}" we re-create the reduction again using fresh random bits and re- 
run the approximation algorithm. But once sa S {00, 11}" we know for certain that xa is a 
22/21-approximate solution for Max-2-SAT, as needed. The expected number of trials is at most 
1/(1 - 0.9") = 0(1). □ 



3 Non-existence of an EPTAS for Closest to k Strings 

Ma showed in j20] that the Closest to k Strings problem has a PTAS, which contrasts with the APX- 
hardness we obtain for the other problems in this paper. A natural question that comes up after 
a PTAS is obtained, is whether the running time can be improved to an EPTAS, or even further 
to a FPTAS (running time polynomial in the input length and e~^). We observe there does not 
exist an EPTAS for Closest to k Strings when the alphabet is unbounded, unless W[l] = FPT. To 
see this, we use a well-known fact relating fixed-parameter algorithms to the notion of an EPTAS, 
e.g. see |18j, along with the fact that the decision version of Closest to k Strings is W[l]-hard when 
parameterized by d [5]. 

In detail, suppose for the sake of contradiction that we had an EPTAS for Closest to k Strings, 
i.e. that one could obtain a (1 -|- e)-approximation in time f{e)s^^^^ where s is the input size. It 
is enough to prove that there is an FPT algorithm for the decision version of Closest to k Strings, 
with parameter d. Given an instance of this parameterized problem we need only call the EPTAS 
with any e less than (d + l)/d; notice the resulting algorithm takes FPT time with respect to d. 
To analyze this, let dALG be the distance value of the solution produced by the EPTAS algorithm, 
and doPT be the optimal distance value. If dopT < d, since dppT < dALG < (1 + ()doPT and 
doPT, dALG £ Z, we have doPT = dALG < d. Otherwise, dALG ^ doPT > d. So, we get an FPT 
algorithm just by comparing dALG to d. 

Observation 1. Closest to k Strings has no EPTAS unless W[l] = FPT. 



4 APX-Hardness of Most Strings with Few Bad Columns 

In this section, we prove that the Most Strings with Few Bad Columns Problem is APX-hard, even 
in binary alphabets. To do this we reduce from the D ens est-k- Subgraph Problem: given a graph 
G = {V, E) and a parameter k, find a subset [/ C 1/ with \U\ = k such that is maximized — 

here E[U] denotes the induced edges for U, meaning the set of all edges with both endpoints in U. 
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Figure 2: Example of the reduction from an instance of Densest-fc-Subgraph with G and /c = 3 to 
an instance of Most Strings with Few Bad Columns with 6 strings of length 5. 

Our reduction will be approximation-preserving up to an additive +1 term. Given an instance 
{G = (y, -E), k) of Densest-/c-Subgraph, we will generate an instance of Most Strings with Few Bad 
Columns with ji?! + 1 strings, each of length \ V\, and with the same values for the two parameters 
k (size of subgraph, maximum number of bad columns). 

Let us define the set S of strings generated by the reduction; to do this, index V = {^1,^2, • • • }• 
For each edge e = ViVj G E, let that edge's 0-1 incidence vector x(e) be the 0-1 string with Is 
in positions i and j and elsewhere; we put x(e) into S. Finally, we put one more string into S, 
namely the all-zero string 0. This completes the description of the reduction; note it only takes 
polynomial time. See Figure [2] for an illustration of this reduction. 

Claim 2. Let a he the optimal value for the Densest-k- Subgraph instance. Then the optimal value 
(3 for the new Most Strings with Few Bad Columns instance is (3 = a + 1. 

Proof. First we show the easy direction, that /3 > a + 1. Consider the optimal U for Densest- A;- 
Subgraph, so that |-E[?7]| = a and \U\ = k. Define a subset T of 5 by T = {0} U {x(e) | e S F}. 
Then the strings in T are all zero on any index corresponding to a node outside of V; the only bad 
columns are those corresponding to nodes in V, of which there are only k. So;5>|T|=q + 1. 

For the reverse direction, take a subset T of /3 strings that have at most k bad columns. We 
can assume without loss of generality that the string is in T, as the following structural lemma 
shows. 

Lemma 2. Let T C S be a subset of strings with at most k bad columns. Then there is a subset 
T' of S with at most k bad columns, \T'\ > |r|, and G T' . 

Assume for the moment that the lemma is true. Then we simply reverse the above reduction 
to show a > f3 — 1. Take an optimal set S* of strings with \S*\ = f3 and such that S* has at most 
k bad columns. By Lemma [2] we may assume £ S* — this implies that the set J of all non-bad 
columns for 5* satisfies s(j) = for all s £ S*,j £ J. Thus, each xi^v) £ S* \ {0} has both of its 
Is appearing at positions in [£] \ J, or equivalently each such uv is an element oi E[V\J]. So V\J 
is the required solution for Densest-Zc-Subgraph, and it has at least /3 — 1 induced edges. 

Proof of Lemma^ Assume that T, otherwise the lemma trivially follows. Also, assume W = 
Tu{0} has more than k bad columns, otherwise we can take T' = W. Thus there must be a column 
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that is not bad for T but that becomes bad when adding 0. I.e. T has a column that is entirely Is. 
It follows that, viewed in the original graph setting, there exists a vertex v that is an end-point of 
all the edges corresponding to T. Pick any such edge arbitrarily, i.e. suppose s = xi^w) £ T. Since 
the input graph is simple, in column w, all entries of T are except for xivw). Hence, T' = r\sUO 
satisfies the lemma: compared with T it is bad in column v but not bad in column w. □ 

This ends the proof of Claim [2] □ 

This reduction yields our result: 

Theorem 2. The Most Strings with Few Bad Columns Problem is NP-hard, and APX-hard. 

Proof. Khot [13J showed that the Densest-fe-Subgraph Problem is APX-hard. We need only to 
argue that our reduction can transform a PTAS for Most Strings with Few Bad Columns into 
a PTAS for Densest- A;- Subgraph. Indeed, if we had a (1 + (5)-approximation algorithm for Most 
Strings with Few Bad Columns, then we get an algorithm for Densest-fc-Subgraph that always 
returns a solution of value at least 

{OPT + 1)/(1 + J) - 1 = {OPT - 5) /{I + 5)> 0PT{1 - 6)/{l + 6) = OPT /{I + 0{5)) 

where we used OPT > 1 in the middle inequality. □ 

While we ruled out a PTAS, it would also be out of the reach of current technology to obtain 
a constant or poly logarithmic factor for Most Strings with Few Bad Columns, because the best 
known approximation factor for the Densest-/c-Subgraph Problem is 0(|y|^/^^^) 

5 Conclusions and Open Problems 

Our results demonstrate that while outliers help model the problems associated with using biological 
data, such problems are computationally intractable to approximate. Here are the main open 
problems related to our results: 

• Is there a constant-factor approximation for either Close to Most Strings or Most Strings with 
Few Bad Columns (even over a binary alphabet)? 

• Does there exist an EPTAS for the Closest String Problem? Since the Closest String Problem 
is FPT with respect to d p^, the standard technique used in Section [3] cannot be used naively. 

• Does there exist an EPTAS for the Closest to k Strings Problem over a bounded-size or binary 
alphabet? The reduction used in Section [3] needs an arbitrarily large alphabet. 
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